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Abstract — In many signal detection and classification problems, 
we have knowledge of the distribution under each hypothesis, 
but not the prior probabilities. This paper is aimed at providing 
theory to quantify the performance of detection via estimating 
prior probabilities from either labeled or unlabeled training 
data. The error or risk is considered as a function of the prior 
probabilities. We show that the risk function is locally Lipschitz 
in the vicinity of the true prior probabilities, and the error of 
detectors based on estimated prior probabilities depends on the 
behavior of the risk function in this locality. In general, we show 
that the error of detectors based on the Maximum Likelihood 
Estimate (MLE) of the prior probabilities converges to the Bayes 
error at a rate of ri ^/^, where n is the number of training data. If 
the behavior of the risk function is more favorable, then detectors 
based on the MLE have errors converging to the corresponding 
Bayes errors at optimal rates of the form n~^^^'^''^^, where a > 
is a parameter governing the behavior of the risk function with a 
typical value a = 1. The limit a ^>- oo corresponds to a situation 
where the risk function is flat near the true probabilities, and 
thus insensitive to small errors in the MLE; in this case the error 
of the detector based on the MLE converges to the Bayes error 
exponentially fast with n. We show the bounds are achievable no 
matter given labeled or unlabeled training data and are minimax- 
optimal in labeled case. 

Index Terms — Detector, minimax-optimality, maximum like- 
lihood estimate (MLE), prior probability, statistical learning 
theory 



I. Introduction 

IN many signal detection and classification problems the 
conditional distribution under each hypothesis is known, 
but the prior probabilities are unknown. For example, we may 
have a good model for the symptoms of a certain disease, but 
might not know how prevalent the disease is. There are two 
ways to proceed: 

1) Neyman-Pearson detectors 

2) Estimate prior probabilities from training data 

Neyman-Pearson detectors are designed to control one type 
of error while minimizing the other Detectors based on esti- 
mating prior probabilities aim to achieve the performance of 
the Bayes detector (see, e.g. Devroye, Gyorfi, and Lugosi|il|). 
We study this second approach and provide theory to quantify 
the performance of detectors based on estimating prior prob- 
abilities from training data. We will focus on simple binary 
hypotheses and minimum probability of error detection, but 
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the theory and methods can be extended to handle other error 
criteria that weight different error types and to m-ary detection 
problems. This problem can be viewed as a special case of 
the classification problem in machine learning in which we 
have knowledge of the density under each hypothesis. These 
conditional densities are called the class-conditional densities, 
in the parlance of machine learning, and we will use this 
terminology here. Detectors based on "plugging-in" the Maxi- 
mum Likelihood Estimate (MLE) of the prior probabilities are 
simply a special case of the well-known plug-in approach in 
statistical learning theory. We use this connection to develop 
upper and lower bounds on the performance of detectors based 
on the MLE of prior probabilities. 

Let us first introduce some notations for the problem. Let 
X £ M.'^ denote a signal and consider a binary hypothesis 
testing problem 



Ho 
Hi 



X 
X 



Pa 

Pi, 



where po and pi are known probability densities on M''. Let 
y be a binary random variable indicating which hypothesis 
X follows, and define q :— P{Y — 1), the probability that 
hypothesis Hi is true. The Bayes detector is defined by the 
likelihood ratio test 

PijX) gi 1-g 

Pq{X) Ho q 

and it minimizes the probability of error 

Let A(a;) :~ pi{x) / po{x) and define the regression function 

T]{x): 



r]{x) := P{Y = 1\X = x) = 



qpi{x) 



(1 -q)pQ{x) +qpi{x)' 
then the Bayes detector can be expressed as 



r(^) 



^{v(x)>l/2}- 



Note that rj{x) is parameterized by the prior probability q. 
Let us consider the probability of error, or risk, as a function of 
this parameter. For any feasible prior probability q', let R{q') 
denote the risk (probability of error) incurred by using q' in 
place of q. The value q defined above produces the minimum 
risk. The difference R{q') — R{q) quantifies the suboptimality 
of q'. The quantity R{q') can be expressed as: 

Riq')=qPliq') + {l-q)Poiq'), 
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where 



P(A(.t) < (1 - q')/q'\H^) 

/ ^{K{x)<{l-q')/q'}Pl{x)dx 

¥{k{x)>{l-q')/q'\H,,) 

'^{A(x)>(l~q')/q'}Po{x)dx 



PoW) 



Assume there is a joint distribution n — ttxy over the 
signal X and label Y. This distribution determines both the 
class-conditional densities (by conditioning on y = or 
Y = 1) and the prior probabilities (by marginalizing over 
X). Suppose we have n training data distributed independent 
and identically according to vr. We will consider cases with 
"labeled" {{X^,Y^)}f^^ or "unlabeled" {X,}f^^ data and use 
them to estimate the unknown prior probability q. Let q stand 
for the MLE of q based on training data, the risk of the 
detector based on q is R{q). Note that R{q) is a random 
variable and it is greater than or equal to R{q). The goal of 
this paper is to bound the difference E[i?(g)] — R{q), where 
E is the expectation operator, and to provide lower bounds 
on the performance of any detector derived from knowledge 
of the class-conditional densities and the training data. The 
difference E[_R(q)] — R{q) is usually called the excess risk or 
regret, and it is a function of n. 

Statistical learning theory is typically concerned with the 
construction estimators based on labeled training data with- 
out prior knowledge of class-conditional densities. There 
are two common approaches: plug-in rules and empirical 
risk minimization (ERM) rules (see, e.g., Devroye, Gyorfi, 
and Lugosi[lJ and Vapnik[2J). Statistical properties of these 
two types of classifiers as well as of other related ones 
have been extensively studied (see Aizerman, Braverman, 
and Rozonoer|3|, Vapnik and Chervonenkis|4|, Vapnik|2||5|, 
Breiman, Friedman, Olshen, and Stoned, Devroye, Gy- 
orfi, and LugosifTl, Anthony and BartlettfTl, Cristianini and 
Shawe-Taylor|8| and SchoUcopf and Smola|9| and the ref- 
erences therein). Results concerning the convergence of the 
excess risk obtained in the literature are of the form 

nRifn)] - R{.n = o{n-^) 

where /? > is some exponent, and typically /? < 1/2 if 
R{f*) ^ 0. Here /„ denotes the nonparametric estimator 
of the classifier, /* denotes the Bayes classifier. Mammen 
and Tsvbakov lfTol first showed that one can attain fast rates, 
approaching n~^, and for further results about the fast rates 
see Koltchinskii|ir|, Steinwart and Scovel[12|, Tsybakov and 
van de Geer|13|, Massart|14| and Catoni lTSl . The behavior 
of the regression function 77 around the boundary dG* = 
{x : ri{x) = 1/2} has an important effect on the convergence 
of the excess risk, which has been discussed earlier under 
different assumptions by Devroye, Gyrofi, and LugosifTI and 
Horvath and Lugosill6|. In this paper, we are consider the 
"margin assumption" introduced in Tsvbakov lflTl . In Audibert 
and Tsybakov|18|, they showed there exist plug-in rules con- 
verging with super-fast rates, that is, faster than n^^ under the 
margin assumption in Tsvbakov lflTl . In our case, which can 



be viewed as a special case of plug-in rule, we take advantage 
of Lemma 3.1 infTsl. 

Our main results can be summarized as follows. No matter 
given labeled or unlabeled data, we show the excess risk 
converges and deduce the rate of this convergence. The con- 
vergence rate depends on the local behavior of the function 
R{q) near q, which is determined by the behavior of 77(2;) in 
the vicinity of ri{x) — 1/2. In general, R is locally Lipschitz 
at q, and the convergence rate is proportional to n^^^^. If R is 
smoother/flatter at q, then the convergence rate can be much 
faster taking the form n~'^+"^/^, where a > is a parameter 
reflecting the smoothness of R at q. The value a = 1 is a 
typical value and we actually have n^^ convergence rate under 
mild conditions. The limit a — > 00 corresponds to a situation 
where the risk function is flat near the true probabilities, 
and thus insensitive to small errors in the estimate of prior 
probabilities, in which case the detector based on the MLE 
converges to the Bayes error exponentially fast with n. We 
also show that the convergence rates are minimax-optimal 
given labeled data. Fig. 1 depicts three cases illustrating the 
smoothness conditions and corresponding ri{x) considered in 
the paper. 
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(a) difficult case (b) moderate case (c) best case 

Fig. 1. Examples of R{g) and corresponding r]{x) leading to different 
convergence rates 

The paper is organized as follows. In Section II and III, we 
discuss the minimax lower bounds and upper bounds achieved 
by MLE with labeled data. Section IV discusses convergence 
rates when we only have unlabeled training data. Section V 
compares our results with those in standard passive learning 
and makes final remarks on our work. 

II. Convergence Rates in General Case with 
Labeled Data 

This section discusses the convergence rates of proposed 
detector trained with labeled data without any assumptions. 
Let q be the MLE of q, i.e. 



define 



C^'^{Y,=i})/n, 



'P := {iPi,Po,q)}, 
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where pi,po are class-conditional densities and q is prior 
probability. 

We set up a minimax lower bound; 

A. Minimax Lower Bound 

Theorem 1. There exists a constant c > such that 

inf supE[i?(g)] - R{q) > cn~^^^, 

q -p 

where sup takes supremum over all possible triples {pi,pQ,q) 

V 

and inf denotes the infimum over all possible estimators of 

q 
q derived from n samples of training data with the prior 

knowledge of class-conditional densities. 

Theorem [T| can be viewed as a corollary of Theorem [3] 
(given in the following section) if we take a — and remove 
constraints on pi{x),po{x) and q in Theorem [3] 

B. Upper Bound 

Theorem l.Ifq is MLE of q, we have 

supE[i?(g)]-i?(g)<in-i/2. 

Proof: Define parametrized risk function as 

^('Zi;'Z2) := q2Pl{qx) + (1 - 92)^0(91), 

following the proof showing q = aigmm R{q), we know 

q 

q — arg min R{q';q). 
q' 

We express the excess risk as 

E[Riq)^R{q)] = E[R{q;q)^R{q;q)] 
< E[R{q;q)^R{q;q)], 

if we write R{q; q) — R{q; q) explicitly as follows 

Riq; q) - Riq; q) = {q- q)iPiiq) - Poiq)), 

thus we have 

E[R{q)~R{q)] < E[{q-q}{P,{q}-Po{m 

< n\q-m 



III. Faster Convergence Rates with Labeled Data 

In Section |lll] and Section IIVI without loss of generality, 
we assume the true prior probability q lies in closed interval 
[9, 1 — 0], where 9 is an arbitrarily small positive real number. 
The reason why we need this assumption is explained in 
Section HITaI 

Define the trimmed MLE of q as 

q-.— aig max (7i-i=i ^'(l — «)^t=i(i^^») 
qe[e,i-0] 

and construct the regression function estimator rjn{x; q) as 

^ / . ^N ^ qpijx) 

i^-q)PQ{x) +qpi{x)' 

The accuracy of f}n{x;q) is closely related to that of 
estimating q from n training data. We set up a lemma to 
describe the Lipschitz property of rjn{x;q) as a function of 

q- 

Lemma 1. The regression function estimator r]n{x) satisfies 
Lipschitz property as a function of q 

Vgi,g2 e [^,1 -9], sup \fjn{x;qi) -fjnix;q2)\ < L\qi -92], 

where L = 1/(461(1 - 9)). 

Proof: Denote f(t, x) = tpi{x)/{tpi{x) + (1 - t)poix)), 
we are interested in the partial derivative of / over t: 

df 



PoPi 



> 0. 



dt (tpiix) + {I - t)po{x))^ 
Since i e [6', 1 — 6*], we have 

PoPi ^ PoPi 



< 



{tp,{x) + {l-t)po{x)y^ - (2v/<(l-t)pipo)2 

1 



< 



49{1 - 9) ' 



thus 



< ^n{q- 


-m 


- y*' 

V n 
~ 2 


q) 



which completes the proof of Theorem |2] ■ 

Remark 1. General results in this section also apply when 
Pi{x),i = 0,1 are probability mass functions (pmf). In this 
case, we can write Pi{x), i = 0,1 as summation of a series of 
weighted Dirac Delta functions, i.e., 



then all of the arguments above hold. 



V'7i,<72 e [^,1 -9], sup |77„(a;;gi) -r]n{x;q2)\ < L\qi - q2\. 

where L ^ 1/(461(1 - 9)) > 1. U 

Remark 2. On the decision boundary, we have 

qpi{x) = {l-q)pa{x), 

which makes the inequality shown in the proof of Lemma |7] 
hold equality, thus we know the Lipschitz constant L cannot 
be further improved. 

A. Polynomial Rates 

Tsvbakov lfm introduced a parametrized margin assumption 
denoted as Assumption (MA): 

There exist constants Co > 0, c > 0, and a > 0, such that 
when a < 00, we have 

Fx(0<|77(^)-i|<t)<Coi" Vi>0, 

when a = 00, we have 

Px(0<|ry(X)-i|<c) = 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 



Denote 

Vg.a '■— {(pi,POi<z) '■ Assumption (MA) satisfied 

with parameter a and q <E [9,1 — 0]}, 

the case a = is trivial (no margin assumption) and it is the 
case explored in Section|II] If d = 1 and the decision boundary 
reduces, for example, to one point xq. Assumption (MA) may 
be interpreted as 

for X close to xq. This interpretation shed light on one fact 
that a = 1 is typical. If ri{x) is differentiable with non-zero 
first-order derivative ai x = xq, then we know the first-order 
approximation of rj{x) in the neighbourhood exists, which 
means a = 1 in this case. When t]{x) is smoother, for example, 
if the first-order derivative vanishes at x = xo but the second- 
order derivative doesn't, then we have a — 1/2. When ri{x) 
is not differentiable at a; = xq, then we may have a > 1, for 
example, when a = 2, the derivative of rj{x) at a; = a;o goes 
to infinity. nxY satisfying Assumption (MA) with larger a 
all have more drastic changes near the boundary t]{x) — 1/2, 
which makes R{q) less sensitive to small errors, leading to 
faster rates. The R{q) and corresponding rj{x) with typical 
a = 1 in Assumption (MA) are shown in Fig. 1(b). 

We explain the reason why we need to bound the domain 
of q by showing what determines Cq in Assumption (MA). 

Consider the typical case when a = l,d^ 1, calculate the 
derivative of rj{x) against x at point x — x^ that the decision 
boundary reduces to, we have 

// X /, .p'i{xo)po{xo) - p'q{xo)pi{xo) 
{qpi{xo) + (1 -q)pQ{xQ)y 
Without loss of generality, suppose the marginal distribution 
of X is uniform, as the first-order approximation of j]{x) is 

A?7(a;) « 7?'(xo)Aa:, 

we know 

PxiO < \v{X) - i| < <) ex — ^t (X -j-^. 
2 r]'{xo) q{l - q) 

Then we can see if q goes to zero or one, the constant Co 
will approach infinity, which illustrates why we assume q E 
[9,1 — 9], 9 > in the beginning of Section |lll] Assumption 
(MA) provides a useful characterization of the behavior of the 
regression function ii{x) in the vicinity of the level ri{x) = 
1/2, which turns out to be crucial in determining convergence 
rates. 

First we state a minimax lower bound under Assumption 
(MA) as follows: 

Theorem 3. There exists a constant c > such that 

inf sup E[i?(g)] - R{q) > cn"(i+")/2 

9 Vb.c 

The proof is given in Appendix A. It follows the general 
minimax analysis strategy but is a non-trivial result. 

Next we show 7i^(i+")/2 is also an upper bound. Introduce 
Lemma 3.1 in Audibert and Tsybakov LlSJ which is rephrased 
as follows: 



Lemma 2. Let rjn be an estimator of the regression function 
rj and V a set of -kxy satisfying Margin Assumption (MA). If 
we have some constants C'l > 0, C2 > 0, far some positive 
sequence an, for n > 1, any 5 > Q, and for almost all x w.r.t. 
Px, 

sup P{[rir,{x) - ?7(x)| >5)< Cie-^=°"^' 

Then the plug-in detector fn — l{^„>i/2} satisfies the 
following inequality: 

supE[i?(/„)]-i?(r)<Ca-(i+")/2 
Pev 

for n > 1 with some constant C > depending only on 
a, Co, Ci and C'2, where f* denotes the Bayes detector 

Remark 3. Following the proof of Lemma |2] we know C 
increases as the increase of Ci, the increase of constant Cq 
in Assumption (MA) Cq, and the decrease of constant C2. 

Theorem 4. // q is the trimmed MLE of q, there exists a 
constant C > such that 

supE[i?(g)] - R{q) < Cn-(i+")/2 

Proof: According to Lemma [T] we have 

sup [rin{x;q)-r]n{x;q)[< L[q~q[ 

xeR'',q,qel6.1-9] 

Combining with Hoeffding's inequality, we have 

sup P{[rjn{x) - ri{x)[ > S) 



<SUvP{[q-q[>j)<2e'^-'\ 

where L > is the constant in Lemma[T] The inequality above 
shows we can take Ci — 2, C2 — 2/_L^, a„ = n in Lemma |2] 
According to Lemma |2] we know 

supE[i?(g)] - R{q) < Cn-(i+")/2, 



Remark 4. Consider the typical case when a = 1. The 
optimal rate here is n^^, which is faster than naive worst case 
n~^" shown in Section \n\ and the optimal rate in standard 
passive learning, n^'^'^'^^P' , p > shown in Audibert and 
TsybakovfWi. 

Remark 5. Consider the case when true prior probability 
q lies near zero or one. This will make the constant Cq in 
Assumption (MA) go to infinity as shown in the introduction 
of Assumption (MA), constant C2 go to zero as shown in the 
proof of Theorem H] which slows down the convergence of 
excess risk. 



JIAO et al.: MINIMAX-OPTIMAL BOUNDS FOR DETECTORS BASED ON ESTIMATED PRIOR PROBABILITIES 



B. Exponential Rates 

We investigate the convergence rates when a = oo in 
Assumption (MA). Intuitively as a grows bigger, the rates 
can be faster than any polynomial rates with fixed degree as 
is shown in Theorem]?] 

Theorem 5. If q is the trimmed MLE defined above, under 
Assumption (MA) when a = cxd, we have 

sup E[i?(q)] - R{q) < 26-2"^='/^', 

where c is the positive constant in Assumption (MA), L is the 
constant in Lemma ]7] 

Proof: According to Lemma \T\ we know as long as 
Iq — q\ < c/L, q E [0,1 — 9], the error of regression function 
estimator is bounded uniformly by c, incurring no error in 
detection according to Assumption (MA) when a — oo. 
The mathematical representation is: R{q) = i?(g),Vq G 
[q-c/L,q + c/L]n [0,1-9]. 
Then we write the excess risk as follows: 



(/ +/_ )[Riq) - Riq)]dP 

where P is the probability measure on sample space fi of 

Taking S — c/L, the second term vanishes. 
Applying Chernoff's bound, the first term is bounded by 
2g-2riA ^ gQ ^g conclude 

sup E[R{q)] - R{q) < 2e-2""'/^'. 



Remark 6. When pi{x),i = 0,1 are probability mass func- 
tions, if X takes value in X with 7^{<-f} < oo, then 

inf [T](xi) - 1/21 > c> 

which means there exists a constant c > such that 

P(0 < |r;(X) - 1/2| < c) = 0. 

Based on discussions above, an exponential convergence rate 
is always guaranteed when x lies in discrete finite domain. If 
4j^{X} is infinite, then we may have finite a > with optimal 
convergence rates n"'^"*"""^. However, finite #{<%"} is the 
case that often arise in practice. 

IV. Convergence Rates with Unlabeled Data 

In this section, we discuss convergence rates when we only 
have unlabeled training data. Relatively speaking, unlabeled 
data is more likely and easier to be obtained in practice 
than the labeled, thus convergence rates analysis in this case 
deserves more attention. Meanwhile, it also helps revealing 
how much information is stored in {Xi}'^^^ in the training 
data pairs {{X,,Y,)yU. 

In this case, we are faced with a classical parameter esti- 
mation problem. Given 

Xi,...,Xn '^ qpi{x) + (1 -q)pa{x), 



we want to construct estimator q to estimate q as efficiently 
as possible. Here we use the MLE and derive upper bounds 
under Assumption (MA). 

Before starting the proof, we introduce a standard quantity 
measuring distances between probability measures. 

Definition 1. The total variation distance between two 
probability density functions p, q is defined as fallows: 

V{p, q) — sup [ I {p — q)dv[ =1—1 min(p, q)dv 
A J A J 

where v denotes Lebesgue measure on signal space W^ and 
A is any subset of the domain. 

We will quantify our results in terms of the total variation 
distance. Here we assume 

V{pi,Po) > Knin > 0, 

ensuring that the two class-conditional densities are not 'too' 
indiscernible, so that it is possible to learn the prior probability 
q from unlabeled data. For details about how this assumption 
works please see Appendix B. 
Define a class of triples: 

^e,Q,Vmi„ '■— {{pi,Po,q) ■ Assumption (MA) satisfied 

with parameter a, q £ [0,1 — 9] and V{pi,po) > Vmin > 0}, 

and define the trimmed MLE q in this case as 

n 

§":=arg max y^\og{qpi{xi) + {1 - q)po{xi)). 
I— 1 

We set up an upper bound for the performance of trimmed 
MLE 9: 

Tlieorem 6. If q is the trimmed MLE defined above, there 
exists a constant C > such that 



Ve.. 



sup 



E[R{q)] - R{q) < Cn-(l+")/2 



The proof of Theorem ]6] is given in Appendix B. 

Remark 7. We can show the calculation of MLE is a convex 
optimization problem, for which we have efficient methods. 

Remark 8. Compared to learning detectors based on labeled 
data, we need to sacrifice convergence rates by a constant 
factor when given unlabeled data. Given true prior probability 
q, when V(pi,Pq) is smaller, the constant C2 in Lemma ^ 
becomes smaller at the same time, which slows down the 
convergence of excess risk. This phenomenon is discussed in 
the proof of Theorem ]6] 

V. Final Remarks 

This paper present convergence rates analysis for detectors 
constructed using known class-conditional densities and esti- 
mated prior probabilities using the MLE. All of the bounds 
are dimension-free. The bounds are minimax-optimal given 
labeled data and achievable no matter given labeled or unla- 
beled data. It remains an interesting open question to show the 
rate n^("+^)/^ is minimax-optimal given unlabeled data under 
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assumption (MA) and the extra assumption on V{pi,po), or 
to establish the same upper bound on convergence rates for 
unlabeled case without the extra assumption on V{pi,pq) in 
Section HV] We show the constant factors in convergence rates 
are mainly influenced by two elements: 

1) The value of true prior probability 

2) Unlabeled data case: V{pi,po) 

We show a prior probability near zero or one will lead to 
slower convergence no matter given labeled or unlabeled data, 
in unlabeled data case, a smaller V{pi,po) leads to slower 
convergence. 

Our results are analogous to those of general classification 
in statistical learning. Intuitively, learning the class-conditional 
densities is the main challenge in standard passive learning and 
it is sensible for us to say that knowing the class-conditional 
densities makes the problem relatively easy. The following 
quantitative results convince us of that. We pick out the fastest- 
ever rate shown before for standard passive learning under 
Assumption (MA) in Audibert and Tsvbakov |,18J and compare 
it with our result in table I: 

TABLE I 

Convergence Rates Comparison under Assumption (MA) 



Passive Learning (pi,po unknown) 



' 2 + p 



Passive Learning (pi , po known) 



Here p = d//3 > 0, where (3 is the Holder exponent of ri{x). 
The rate n^^+p is obtained with another strong assumption 
that the marginal distribution of X is bounded from below 
and above, which isn't necessary here. Here we can see the 
factor p reflects the price we have to pay for not knowing class- 
conditional densities and it is directly related to the complexity 
of non-parametrically learning the density functions. 
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Appendix A 
Proof of Theorem[3] 

The proof strategy follows the idea of standard minimax 
analysis introduced in Tsybakov|19| and consists in reducing 
the problem of classification to a hypothesis testing problem. 
In this case, it suffices to consider two hypotheses. Here, we 
have to pay extra attention to the design of hypotheses because 
we have access to class-conditional densities, which puts extra 
constraint on hypotheses design. We rephrase a bound from 
Tsvbakovlil9i: 

Lemma 3. Denote V the class of joint distributions rep- 
resented by triples {pi,pQ,q) where {pi,Po) ore class- 
conditional densities and q is prior probability. Associated 
with each element (pi^po^q) G V, we have a probability 
measure ttxy defined on M'' x {0, 1}. Let d{-, •) -.VxV^i-S. 
be a semidistance. Let (pi,poj 90)1 (piiPOi ^i) ^ V be such 



that d((pi,Poi 90)7 (PiiPo; 91)) ^ 2a, with a > 0. Assume 
also that KL{TrxY{Pi,Po,qi)\\T^XY{Pi,Po,qo)) < 7. where 
KL denotes to the Kullback-Leibler divergence. The following 
bound holds: 

infsupP^^y(pi.pQ.q)(d((pi,po,g),(pi,Po,g)) > a) 

q -p 

inf inax Rr^y(p^_p„_q^.)(d((pi,po,9),bi,Po,'7j)) > a) 



> 



> 



q i6{o,i} 
max(-exp(-7),— 



-) 



where the infimum is taken with respect to the collection 
of all possible estimators of q (based on a sample from 
t^xy{pitPOtQ) with known class-conditional densities). 




Fig. 2. Two ri(x) used for the proof of Theorem [5] when d = 1 

Denote G„ := {x : rjn{x;q) > 1/2} where fjn{x;q) is 
defined in Section |IV] and the optimal decision regions as 
G* := {x : rij{x) > 1/2}, where the subscript j indicates 
that the excess risk is being measured with respect to the 
distribution TTxY{{pi,Po,qj))7J = 0,1. Take V — Ve,a- We 
are interested in controlling the excess risk 

R,{q)-R,{q). 

To prove the lower bound we will use the following class- 
conditional densities, which allow us to easily attain any 
desired margin parameter a in Assumption (MA) by adjusting 
the parameter k below. 

(l+2cxg-^)(l-2t''-i) 
l-4c(ted)~-i 

I + 2ci{xd - t)""-^ a;e[0,l] 
.0 a:eM7[0,l]'' 



Pi{x) 



xe [0,lf-'^x [0,t) 



id-l 



Poix) 



where x 



-Pii^) 



xe [0,1]'' 

a; e M'*/[0, 1]"* 



{xi, . . . ,Xd), 0<c^l,K>lare constants. 
The quantity < t ^ 1 is a small real number which goes 
to zero as n — > 00, and will be determined later It is easy 
to verify that in order to make j^dPi = 1,* — 0, 1 hold, as 
t — > 0, the number ci is of order 0{t'^), which also goes to 
zero. Assigning prior probabilities to Hi and Hq 

1 



qo = 



qi 



+ t''- 



obviously the margin distribution of X, P^ ' is uniform on 
[0,1]'', Pj^ is approximately uniform on [Ojl]''. We can 
compute the regression functions based on equation 

qjPi{x) 



TJj{x) 



qjPi{x) + {l -qj)po{x)'' 
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and have the exphcit expressions of rjj{x),j E {0, 1} as 



divergence KL(Pi^„||Po^„): 



.(1) 



%(2^) 



mix) 



(l/2+ca^g-^)(l-2t"-i) 
l-4c(t£!;d)"-i 

i + ci(a;<i-t)«-i 




xe [0,1 



d-l 



[0,i) 



KL(Pi,„||Po, 



L[l0g 



2 "*" ^^d 



l+4t« 



ci(a;d-*)" ^ 







xe [0,1]'^-^ X [t, 1] 

a;e R'^/[0,1]'' 

xe [O,!]"^-! X [0,t) 
xe [O,!]"^-! X [i,l] 
X G R''/[0, l]'^ 






i=l 



= 7iEi [log 



pfAx,Y)^ 



From above we see G*^ = [0, l^-^ x [t, 1], G^ = [0, 1]'^. Fig. 2 
depicts ?]j{x),j £ {0, 1} when d — 1. 

In order to further analyze designed hypotheses, we show 
that the parameter a in Assumption (MA) for rij{x),j = 0, 1 
is a = 1/(k — 1). Consider the case j = (the case j' = 1 is 
analogous). 

As r;o((0,...,0,i)) = 1/2 - {i^S^ < 1/2 - (1 - 
c)t^^^ = 1/2 — r*, provided r < r*, we have 



where Ei[log-^5^^ — '■ — ] can be simplified as 



giPi(x)log rT+ / (l-9i)Po(a;)log 



(1 - 90)^0 (a;) 



Po(0<|77o(X)--|<r) 



ci 

Cl 

C rl/(«-l) 



where C^ > 1. The second step follows since Pj^' is uniform 
on [0,1]''. 

Since the excess risk is not a semidistance, we cannot 
apply Lemma [3] directly, but we can relate excess risk and the 
symmetric distance measure, and then use the lemma. First we 
introduce Proposition 1 in Tsvbakov lfTTl rephrased as follows: 

Lemma 4. Assume that P(0 < \'q{X) - 1/2| < r) < C^t"" 
for some finite C,, > 0, a > and all < t < r,, where 
T, < 1/2. Then we know there exist Cq, > 0,0 < eo ^ 1 such 
that 

Rj{q) - Rj{q) > CadA{Gn,G*p)^+^/" 

for all Gn such that d/^{Gn,G*p) < €q < 1, where Ca = 
2C-'/"a(a + l)-i-i/", eo = Cr,{a + l)T^, dA(G„,GJ,) := 
Jg AG* '^^ '^ ^^^ symmetric distance measure. 

When j = 0, plug in t* = (1 — c)t'^^^, since c is very 
small, we know eo = C,,(1 + 1/(k- 1))(1 -c)i/('^-i)t > t/2. 
Analogously we can show when j = 1, eo > V2 also holds. 

We now proceed by applying Lemma |3] to the semidis- 
tance dA and then use Lemma |3] to control the ex- 
cess risk. Note that dA(GQ,G^) = t. Let Po,„ :— 
-^x X -Y y be the probability measure of the ran- 
dom variables {{Xi,Yi)}2^-^ under hypothesis and define 
analogously Pi „ := Pj^ x v y ■ Consider the KL- 



QoPiix) 



^ qi log — + (1 - gij- . 

90 1 - go 

The expression in the last line is the KL-divergence between 
two Bernoulli random variables. It can be easily verified that 
the KL-divergence between two Bernoulli random variables is 
bounded as in the following lemma: 

Lemma 5. Let P and Q be Bernoulli random variables with 
parameters, respectively, 1/2-p and 1/2-q. Let |p|, |g| < 1/4, 
then KL{P\\Q) < 8{p-q)'^. 

Thus we know 

KL(Pi,„||Po,„) < 8n(t''-i)2 

Taking i = n~^^^^ , d{{pi,po,q},{pi,po,qj)) -^ dA(G„,G*) 
and using Lemma [3] we know for n large enough (implying 
t small), 

inf max Pj{d{{pi,po,q), ipi,po,qj)) > t/2) 
g j6{o,i} 

> l/4exp(-8). 

Notice in Lemma ID eo > t/2, so we can apply Lemma |4] to 
show 

inf max Pj(Rj{q) - Rj{q) > Co,{t/2)'^) 
q ie{o,i} 

> inf max P,(d((pi,po, ?), {pi,Po, Qj)) > t/2) 

q je{o,i} 

> l/4cxp(-8). 

According to Markov's inequality, we conclude 

inf sup E[P(g) - R{q)] > d-nT^^^ 

where a^ 1/(k-1),c' == \e-^Ca{\)'^ . 

Appendix B 
Proof of Theorem[6] 

We introduce two more quantities measuring distances be- 
tween probability distributions. 
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rUq,q + h) < x\f{x,q + h),f{x,q)) 



Definition 2. The Hellinger distance between two probability 
density functions p, q is defined as follows: 

Hip, q) = iji^ - V^fdu)'^' ^ J ^(■^' '^) + f{x,q) 

+2h{pi — po)di^ — 1 

Definition 3. The x^ divergence between two probability = h^ I SEl ^2Z 

density functions p, q is defined as follows: J 1P^ + v^ ~ Q)Po 

{Pi-Po? 



h 



xHp,i)=I '-d^-l J,.>,o^iP^-Po)+Po 



pq>0 9 , ,,2 f jPO-Pl) 



pi<po [^ - 1){P0 - Pi) + Pi 



As is shown in Tsybakov(19], we have the following 2 f {pi — Po)^ 



inequalities - j ,,>,„ q{pi - p^) 



2 / {PO-Plf 



Define f{x,q) — qpi{x) + (1 — q)pQ{x), we use Hellinger — ^ n 1 — g v^i'^oj 
distance to measure the error of estimating q from training „ l 
data: = ^ — IT^bi'Po) 



r2{q,q + h):=H{f{x,q),f{x,q + h)) < ;i2^^^^(p^ .p^-. 



9(1 - q) 

_2 - 



< h^ ' 



We introduce a concentration inequality for MLE, i.e., _ o/t _ 

Theorem 1.5.3 in Ibragimov and Has'minskii ll20l rephrased ^ 

follows- Thus, we verified the first assumption in Lemma |6| by 



asserting 
Lemma 6. Let Q be a bounded interval in M, f{x,q) be _9 9 1 

a continuous function of q on Q for v-almost all x where ™P ^^P " ''2 (^i 1 + h) < ^ ^ < 00 

ly denotes the Lebesgue measure on M.'^, let the following 
conditions be satisfied: 2. r|(g, g + /i) > a|/i|V(l + |/j|0 q e K Since 

1) There exists a number ^ > 1 such that rl{q,q + h) > V^{f{x,q)J{x,q + h)) 

= h^V^ip,,po) 

supsup|/i|~^r2(g,gr + /i) = A < 00 V'^{pi,Po)h'^ 

q£Q h _ ^^^2 

2) For any compact set K there corresponds a positive ^ -. , 2 ' 

number a{K) = a > such that 

we can take a — V^^^^. Then we can show 

r2(g,g + /i) > Y-pr^ g G if , 9G[e,i-e] 

Applying Lemma 12] by taking C\ = Bo, C2 = boa, a„ = n, 
then the maximum likelihood estimator q is defined, consistent ^^ complete the proof of lheorem|6J ■ 

'^"" Remark 9. In the proof of Theorem we have 

snpPgi\q-q\ >e)<Soe-'""^"^', ^,(f {pi - Po? \ > g(l - g) 

'^"^ ' \J qpi + (1 - q)pj - VipupoY 

, , . . n , , , , , r^ where the left term is the reciprocal of the fisher information 

where the positive constants Bq and Op do not depend on K, . , , , , , , , ■ , ■ , r , ■ r 

, , ^ . . , given unlabeled data, and the right term is the fisher informa- 

n IS the number of training data. . . , , , , , ,..,,, t^/ \ ^, ■ ■ ,■ 

tion given labeled data divided by V [jpi,po). Ihis inequality 

Taking Q = K = \9,1 — 0], it suffices to show the two holds equality when pi and po don't verlap at all. Since the 

assumptions in Lemma |6] hold with ^ = 2, then we can use minimum variance of unbiased estimator is described by the 

Lemma |2] to complete the proof. reciprocal of fisher information, this inequality shows that the 

PfQQf. convergence from q to q in unlabeled case can never be faster 

, 11,1-2 2/ t 7 \ A ^ than that in labeled case, and will be slower if V{vi,vo) is 

1. sup sup ft ^rjfg. g + ft) = A < 00 ■' vj-^jj-u; 

qeQ h ' small. 
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